Training certified detectives to track down the intrinsic shortcuts in COVID-19 chest x-ray data sets

Deep learning faces a significant challenge wherein the trained models often underperform when used with external test data sets. This issue has been attributed to spurious correlations between irrelevant features in the input data and corresponding labels. This study uses the classification of COVID-19 from chest x-ray radiographs as an example to demonstrate that the image contrast and sharpness, which are characteristics of a chest radiograph dependent on data acquisition systems and imaging parameters, can be intrinsic shortcuts that impair the model’s generalizability. The study proposes training certified shortcut detective models that meet a set of qualification criteria which can then identify these intrinsic shortcuts in a curated data set.


Introduction
Deep learning has been incredibly successful in object detection, image classi cation, and natural language processing over the past ten years due to its ability to learn complex features from data. However, despite its success on benchmark datasets, there are limitations and practical issues when using these models in real-world scenarios. A major challenge is poor generalizability, where performance signi cantly drops when applied to external datasets [1][2][3] . This limits the translation and deployment of deep learning models for high-stakes tasks, such as in healthcare applications. This lesson was evident during the COVID-19 pandemic, where many machine learning models were developed, but very few performed well on real-world clinical tests [4][5][6] .
The concept of shortcut learning has recently been explored in deep learning studies 7 . It has been discovered that poor model generalizability can be attributed to shortcut learning when the training dataset has hidden shortcuts, meaning there are spurious correlations between irrelevant image features and the corresponding training labels. This causes models to quickly pick up these spurious correlations instead of the desired image features, establishing incorrect connections between input image data and output labels. For instance, early studies have shown that deep learning models can differentiate chest xrays from different hospitals and patient groups 1 . This suggests that different data sources and patient characteristics like gender, age, and race could also become shortcuts.
To illustrate the issue of shortcut learning in real-world clinical scenarios, let's take the example of COVID-19 classi cation using chest x-ray radiographs (CXRs). DeGrave et al. 8 discovered that if COVID-19 positive and negative training data were collected from two different sources, the model would only learn the source label as a shortcut for prediction. As a result, the model would not have the desired prediction power in real-world clinical scenarios. The authors also found that the trained model used extrinsic image features, such as lead markers, for prediction, even though these markers only indicated the orientation of patients in x-ray image acquisitions and did not correspond to any disease features. Some suggestions have been made to remove these extrinsic shortcuts by a segmentation step 9 . However, even with these markers removed from the training dataset, some other shortcuts still existed in the segmented lung tissue-only training dataset 10 . As a result, the nearly perfect performance of the trained model 9 is still not generalizable to real-world clinical datasets. The remaining shortcuts may be attributed to the inherent de ning features of CXRs, such as image contrast and sharpness, which can vary from hospital to hospital due to different types of imaging systems, generations of x-ray imaging equipment, hardware components, image post-processing methods used by vendors, and imaging protocols used by technologists. All these factors can impact the digital representation of the acquired image data in terms of variations in image contrast and sharpness. Figure 1 demonstrates these variations.
Unlike other shortcuts that have been previously studied, such as age, gender, race, and markers, which are extrinsic and can be removed through careful data collection and cleaning, contrast-and sharpnessrelated shortcuts are more di cult to detect and mitigate. This is because desired image features are also represented as image contrast and spatial correlations, which are similar to the features of contrast-and sharpness-related shortcuts. This entanglement between desired image features and shortcut features makes studying contrast and sharpness-related shortcuts particularly challenging. Consequently, these shortcuts are referred to as intrinsic shortcuts in this work.
In order to develop an effective strategy for mitigating contrast-and sharpness-related shortcuts, it is necessary to rst develop reliable methods for detecting their presence and severity within a carefully curated dataset. While post-hoc model interpretability methods, such as class activation maps 11 and expected gradient 12 , have been developed to identify relevant image features used by trained deep learning models for prediction, these methods are unable to detect intrinsic shortcuts within a curated training dataset prior to model training. Furthermore, studies have suggested that these methods may not be effective in diagnosing poor generalization performance of the model 13 .
In this paper, we present a novel approach for detecting contrast-and sharpness-related intrinsic shortcuts using certi ed shortcut detective models. Our approach involves establishing quali cation standards for suspected intrinsic shortcuts, designing a training curriculum for training the shortcut detectives to detect these shortcuts, performing certi cation tests on the trained detectives, and nally deploying them to curated datasets to examine the suspected shortcuts. We applied this approach to the available COVID-19 datasets to assess their quality. Our results demonstrate the effectiveness of this approach in detecting and mitigating intrinsic shortcuts. Figure 2 provides an overview of the datasets utilized in this study. The MIMIC-CXR dataset served as the training data for the shortcut detectives, while the HF-train dataset, a privately curated COVID-19 chest xray dataset, was utilized for certi cation tests of the trained detectives. The trained shortcut detectives were then applied to a variety of public and private chest x-ray datasets. Speci c information about each dataset is provided below.

RoentGen-MIMIC dataset
This dataset contains 943 synthetic CXRs generated by the RoentGen model 17 and 1,000 real "No Finding" CXRs from the MIMIC dataset. The RoentGen model, which is trained using the MIMIC dataset, is able to generate visually convincing synthetic CXRs with different pathologies. To generate the synthetic CXRs used in this work, a text prompt of "No nding" was used as the input, only frontal view CXRs are included.
Training and certi cation of shortcut detectives Overview of the proposed framework An overview of the shortcut detective training and certi cation process is shown in Fig. 3.
To begin, 46,894 normal CXRs from the MIMIC dataset were selected and randomly divided into two equal groups: 23,447 of the CXRs were assigned as the positive class ("1") while the rest were assigned as the negative class ("0").
Then, to construct the training dataset for the shortcut detective, the image contrast or the image sharpness of the positive class were adjusted using the approach shown in Appendix A1. Since only normal CXRs were included, there were no disease-speci c features present that could be used to distinguish between the two classes. Therefore, if a model was able to differentiate between the two classes, it would be because the model had learned the corresponding global image contrast or sharpness characteristics, rather than disease features.
Subsequently, shortcut detectives (neural network models for binary classi cation) were trained using the constructed training datasets with contrast or sharpness shortcuts. Details of the training are discussed in the following section and in Appendix A2.
To assess the e cacy of the shortcut detectives on COVID-19 CXR dataset, two types of examinations are necessary. Firstly, when the shortcut detective is deployed on a COVID-CXR dataset without the corresponding shortcut, it should not be able to differentiate between the COVID-positive and COVIDnegative classes. Hence, an Area Under the Receiver Operating Characteristics curve (AUC) close to 0.5 is expected. This examination is crucial to ensure that the image features utilized by the shortcut detectives are not interwoven with the original imaging task. Secondly, when the shortcut detective is applied to a COVID-CXR dataset with a known shortcut, it should demonstrate superior classi cation performance, ideally with an AUC close to 1. For the rst examination, the HF-train dataset is utilized. As the positive and negative cohorts are gathered within the same timeframe and from the same hospitals, it is expected that no contrast and sharpness shortcut exists. This has also been corroborated recently 18 , where a model trained on this dataset demonstrated consistent test performance on various external COVID-19 clinical test datasets. For the second examination, known shortcuts are integrated into the COVID-positive class or COVID-negative class of the HF-train dataset using the same procedures outlined in Appendix A1.
Finally, if the trained shortcut detectives pass the two exams, i.e. AUC of close to 0.5 on the shortcut-free dataset and AUC of close to 1.0 on the shortcut-present dataset, they are referred to as certi ed shortcut detectives. Image preprocessing and model architecture For the MIMIC, HF, BIMCV, UW Health and MIDRC datasets, the original DICOM images are converted to 8bit png format with a size of 224-by-224 using the default window level and window width. For the COVIDx dataset, images are directly resized to 224-by-224.
To train the shortcut detectives, ve different model architectures ( Table 1)

Results
Certi cation of the shortcut detectives As shown in Table 2, the trained detectives successfully passed the two exams. Note that an AUC of 0.0 is simply due to the assignment of class labels, which is equivalent to an AUC of 1.0. Both indicate perfect classi cation performance. It is also shown in Table 2 that all ve neural network architectures achieve a similar performance level. This result aligns well with our understanding that contrast and sharpness shortcuts are intrinsic to the dataset and the task.

Shortcut detection in real-world datasets
Using the certi ed shortcut detectives, we investigated several curated COVID-CXR datasets for possible shortcuts. UW Health is a single-site, privately curated dataset; BIMCV and MIDRC are both multiinstitutional public datasets; COVIDx is the rst open access COVID-CXR dataset made available by the experts in the computer science community.
Similar to the detective certi cation process, for a COVID-CXR dataset where COVID-19-positive cases are assigned a label "1" and negative cases are given a label "0", if the shortcut detective can differentiate the two classes (AUC signi cantly deviates from 0.50), it indicates the existence of a corresponding shortcut. For the RoentGen-MIMIC dataset, real CXRs are labelled as "1" and synthetic CXRs are labelled as "0".
The results presented in Table 3 were obtained using shortcut detectives based on the DenseNet model. The results con rm the presence of image sharpness and contrast shortcuts in the COVIDx dataset, which can be exploited by models trained on such data and compromise their generalizability in real clinical settings. Conversely, the other three datasets that were curated by medical professionals exhibit no such shortcuts. We conducted a performance comparison between two models trained on COVIDx and Henry Ford datasets, respectively, which are of similar size, but the former has both sharpness and contrast shortcuts as identi ed by the shortcut detectives. The results displayed in Table 4 indicate that the COVIDx model exhibits poor generalization performance, as evidenced by the large AUC gap between internal and external tests. In contrast, the HF model exhibits consistent performance on both internal and external tests. Additionally, as shown in Table 3, some contrast and sharpness differences were also detected in the RoentGen-MIMIC dataset. While the generated synthetic CXRs appear visually realistic, caution must be exercised when using them for AI model development due to the potential learning of shortcuts caused by the inherent contrast and sharpness differences between real and synthetic data. Note: (+) means the shortcut is added to the COVID-positive CXRs; (-) means the shortcut is added to the COVID-negative CXRs.

Discussion
Shortcut learning has been a topic of interest in the machine learning community, particularly in computer vision (CV) and natural language processing (NLP). Researchers have explored shortcut learning behavior from different perspectives, such as underspeci cation 3 , shortcut learning in various NLP tasks 25 , and mitigation strategies for domain-knowledge agnostic models 26 . Notably, it has been observed that convolutional neural networks tend to rely on content with high spatial frequency or strong local correlations to establish connections between input and labels in CV and NLP 27,28 . However, it remains unclear whether these observations can be extended to medical diagnosis tasks, where clinical datasets have distinct characteristics from those in ImageNet. Therefore, studying shortcut learning for wellde ned, clinically relevant tasks using real-world clinical datasets is crucial for medical AI applications.
In this study, we demonstrated that acquisition-dependent attributes (ADAs), such as image contrast and sharpness differences arising from the entire image generation pipeline, can serve as intrinsic shortcuts during clinical diagnosis learning tasks. Inadequate quality control procedures during data collection can allow these shortcuts to inadvertently in ltrate the curated dataset. If shortcuts contaminate the dataset, neural networks can easily exploit them during training, thereby impairing their ability to generalize to other real-world datasets.
Thus, it is imperative to identify possible shortcuts in the training dataset prior to model development. In this study, we present a methodical framework for training and validating shortcut detectives for chest Xray classi cation, with emphasis on image contrast and sharpness -two essential intrinsic characteristics of chest X-ray images. However, it should be noted that these are not the only possible shortcuts that may exist in chest X-ray datasets. If other intrinsic shortcuts are suspected in a collected dataset, the general framework presented in this work can be utilized to construct similar shortcut detectives and identify alleged intrinsic shortcuts.
Once the intrinsic shortcuts are identi ed, it is essential to develop strategies to mitigate their impact on the learned models. One possible approach is to develop standardization and normalization techniques for image contrast and sharpness to adjust these attributes without affecting the disease features. Alternatively, examining proven intrinsic shortcut-free datasets, such as the baseline dataset (Henry Ford Health) and the three additional datasets (UW Health, BIMCV, and MIDRC) shown to be free of intrinsic shortcuts in this study, can provide further insight on how to avoid these shortcuts in the data curation process. However, it is worth noting that this is a limitation of the present work, and future research should investigate the development of mitigation strategies for the identi ed intrinsic shortcuts.  Acquisition Dependent Attributes (ADA) in chest x-ray images.

Figure 2
An overview of the datasets used in this work.

Figure 3
A framework to train and certify shortcut detectives.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.